Simple example which will:
Note: To better understand the impact of querying with and without a ranker, see the next example on evaluation.
We make use of the InsuranceLibV2 data: https://github.com/shuzi/insuranceQA. At a high level, the InsuranceLibV2 is a Question Answering data set provided for benchmarking and research, it consists of question and answers collected from the Insurance Library.
Coverage follows the car. Example 1: if you were given a car (loaned) and the car has no insurance, you can buy insurance on the car and your insurance will be primary. Another option, someone helped you to buy a car. For example your credit score isn't good enough to finance, so a friend of yours signed under your loan as a primary payor. You can get insurance under your name and even list your friend on the policy as a loss payee. In this case, we always suggest you get a loan gap coverage: the difference between the car's actual cash value and the amount still owned on it. Example 2: the car you are loaned has insurance. You can buy a policy under your name, list the car on that policy and in case of the accident, your policy will become a secondary or excess. Once the limits of the primary car insurance are exhausted, your coverage would kick in and hopefully pay for the rest. I specifically used the word hopefully, because each accident is unique and it's hard to interpret the coverage without the actual claim scenario. And even with a given claim scenario, sometimes there are 2 possible outcomes of a claim.
Does auto insurance go down when you turn 21?
Note: Ensure credentials have been updated in config/config.ini
In [1]:
import sys
from os import path, getcwd
import json
from tempfile import mkdtemp
sys.path.extend([path.abspath(path.join(getcwd(), path.pardir))])
from rnr_debug_helpers.utils.rnr_wrappers import RetrieveAndRankProxy, RankerProxy
from rnr_debug_helpers.utils.io_helpers import load_config, smart_file_open, RankerRelevanceFileQueryStream
from rnr_debug_helpers.generate_rnr_feature_file import generate_rnr_features
config_file_path = path.abspath(path.join(getcwd(), path.pardir, 'config', 'config.ini'))
print('Using config from {}'.format(config_file_path))
config = load_config(config_file_path=config_file_path)
insurance_lib_data_dir = path.abspath(path.join(getcwd(), path.pardir, 'resources', 'insurance_lib_v2'))
print('Using data from {}'.format(insurance_lib_data_dir))
In [4]:
# Either re-use an existing solr cluster id by over riding the below, or leave as None to create a new cluster
cluster_id = None
# If you choose to leave it as None, it'll use these details to request a new cluster
cluster_name = 'Test Cluster'
cluster_size = '2'
bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id=cluster_id,
cluster_name=cluster_name,
cluster_size=cluster_size,
config=config)
In [5]:
collection_id = 'TestCollection'
config_id = 'TestConfig'
zipped_solr_config = path.join(insurance_lib_data_dir, 'config.zip')
bluemix_wrapper.setup_cluster_and_collection(collection_id=collection_id, config_id=config_id,
config_zip=zipped_solr_config)
The InsuranceLibV2 had to be pre-processed and formatted into the Solr format for adding documents.
TODO: show the scripts for how to do this conversion to solr format from the raw data provided at https://github.com/shuzi/insuranceQA.
In [7]:
documents = path.join(insurance_lib_data_dir, 'document_corpus.solr.xml')
print('Uploading from: %s' % documents)
bluemix_wrapper.upload_documents_to_collection(collection_id=collection_id, corpus_file=documents,
content_type='application/xml')
print('Uploaded %d documents to the collection' %
bluemix_wrapper.get_num_docs_in_collection(collection_id=collection_id))
Since we already have the annotated queries with the document ids that are relevant in this case, we can use that to train a ranker.
TODO: show the scripts for how to do this conversion to the relevance file format from the raw data provided at https://github.com/shuzi/insuranceQA.
The ranker trains on top of a features derived between the questions and the answers; so we need to use the service to generate such a feature file first. During this feature file generation process, we need to decide on the num_rows parameter. Will go into this in more detail in a separate example, for now, we set this to 50.
In [7]:
collection_id = 'TestCollection'
cluster_id = 'sc40bbecbd_362a_4388_b61b_e3a90578d3b3'
temporary_output_dir = mkdtemp()
feature_file = path.join(temporary_output_dir, 'ranker_feature_file.csv')
print('Saving file to: %s' % feature_file)
num_rows = 50
with smart_file_open(path.join(insurance_lib_data_dir, 'validation_gt_relevance_file.csv')) as infile:
query_stream = RankerRelevanceFileQueryStream(infile)
with smart_file_open(feature_file, mode='w') as outfile:
stats = generate_rnr_features(collection_id=collection_id, cluster_id=cluster_id, num_rows=num_rows,
in_query_stream=query_stream, outfile=outfile, config=config)
print(json.dumps(stats, sort_keys=True, indent=4))
In [8]:
ranker_api_wrapper = RankerProxy(config=config)
ranker_name = 'TestRanker'
ranker_id = ranker_api_wrapper.train_ranker(train_file_location=feature_file, train_file_has_answer_id=True,
is_enabled_make_space=True, ranker_name=ranker_name)
ranker_api_wrapper.wait_for_training_to_complete(ranker_id=ranker_id)
# Delete local feature file since ranker training is done
from shutil import rmtree
rmtree(temporary_output_dir)
In [23]:
query_string = 'can i add my brother to my health insurance '
def print_results(response, num_to_print=3):
results = json.loads(response)['response']['docs']
for i, doc in enumerate(results[0:num_to_print]):
print('Result {}:\n\tid: {}\n\tbody:{}...'.format(i+1,doc['id'], " ".join(doc['body'])[0:100]))
bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id="sc40bbecbd_362a_4388_b61b_e3a90578d3b3",
config=config)
print('Querying with: {}'.format(query_string))
# without the ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=3" % query_string)
print("\nWithout Ranker")
print_results(response)
# with ranker
pysolr_client = bluemix_wrapper.get_pysolr_client(collection_id=collection_id)
response = pysolr_client._send_request("GET", path="/fcselect?q=%s&wt=json&rows=%d&ranker_id=%s" %
(query_string, num_rows, ranker_id))
print("\nWith Ranker")
print_results(response)
In [ ]: